Web Scraping with R & rvest

Dr. Matthew Hendrickson

July 9, 2020

About Me

  • Dr. Matthew Hendrickson
  • Social Scientist by Training
    • Psychology & Music %>%
    • Cognitive & Social Psychology %>%
    • Law & Policy
  • Professional Experience (13+ years)
    • Higher Education Analyst
    • Independent Consultant
    • Research projects, data analysis, policy development, strategy, analytics pipeline solutions

Topics

  1. A Little About Web Scraping
  2. Robots!
  3. HTML & CSS
  4. The Setup
  5. Scraping the Data
  6. Assembling the Data
  7. References & Resources

A Little About Web Scraping

“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.” – Wikipedia


Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or textual data.

Use Cases

There are many uses for web scraping, including but not limited to:

  1. Price monitoring
  2. Time series tracking and analysis
  3. Sentiment analysis
  4. Brand monitoring
  5. Market analysis
  6. Lead generation

Robots!

  • No, not those robots!
  • Always ensure PRIOR to scraping -hat you have scraping rights!
  • This is critical - you can be blocked or even face legal action!

Robots.txt

Good news! You can easily check with the robotstxt package.

paths_allowed(paths = c("https://netflix.com/"))
#> [1] FALSE

Netflix does not allow you to scrape their site.

HTML & CSS


Hyper Text Markup Language

“HTML is the standard markup language for creating Web pages.”



Cascading Style Sheets

“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.”

– W3Schools

HTML Structure

Image credit: Professor Shawn Santo

HTML Tags

HTML is structured with “tags,” indicating portions of a page.

Tags can be called by their structure.

Tags can be nested.

A few important tags (of many) for scraping:

  • <h1> header tags </h1>
  • <p> paragraph elements </p>
  • <ul> unordered bulleted list </ul>
  • <ol> ordered list </ol>
  • <li> individual list item </li>
  • <div> division </div>
  • <table> table </table>

A Little Help with CSS

Extracting parts of a website can be daunting if unfamiliar with CSS.

SelectorGadget is helpful (Chrome only).

Inspect the page elements is also helpful (most major browsers).

Scraping Methods

HTML - syntax is easier and aligns with HTML tags

XPATH - useful when the node isn’t uniquely identified with CSS

The Setup

Set up the environment to scrape the site.

library(tidyverse)
library(robotstxt)
library(rvest)

That’s it!

Determine a website to scrape

Seems appropriate to pull R book data from Amazon.

paths_allowed(paths = c("https://amazon.com/"))
#> [1] TRUE


We are good to scrape!

Specify the URL

Before you get started, you must specify the URL.

amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")

Data as of 2020-07-07

Titles

Scraping book titles

amazon %>% 
  html_nodes(".s-line-clamp-2") %>% 
  html_text() -> titles
head(titles)
#> [1] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n            \n        \n        \n    \n\n\n    \n"          
#> [2] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                The Book of R: A First Course in Programming and Statistics\n            \n        \n        \n    \n\n\n    \n"                     
#> [3] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Discovering Statistics Using R\n            \n        \n        \n    \n\n\n    \n"                                                  
#> [4] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R Graphics Cookbook: Practical Recipes for Visualizing Data\n            \n        \n        \n    \n\n\n    \n"                     
#> [5] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Advanced R, Second Edition (Chapman & Hall/CRC The R Series)\n            \n        \n        \n    \n\n\n    \n"                    
#> [6] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)\n            \n        \n        \n    \n\n\n    \n"

The element pulls a number of breaks and blank spaces.

Let’s clean this up with str_trim.

Removing \n and white space from the titles

titles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"          
#> [2] "The Book of R: A First Course in Programming and Statistics"                     
#> [3] "Discovering Statistics Using R"                                                  
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"                     
#> [5] "Advanced R, Second Edition (Chapman & Hall/CRC The R Series)"                    
#> [6] "Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)"

Formats

Scraping the book format

amazon %>% 
  html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>% 
  html_text() -> format
head(format)
#> [1] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [2] "\n    \n        \n        \n            Kindle\n        \n    \n"   
#> [3] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [4] "\n    \n        \n        \n            eTextbook\n        \n    \n"
#> [5] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [6] "\n    \n        \n        \n            Kindle\n        \n    \n"

Clean up book format values

format <- str_trim(format)
head(format)
#> [1] "Paperback" "Kindle"    "Paperback" "eTextbook" "Paperback" "Kindle"

Price

Scraping the book price

The price structure splits price into two elements. We must pull each and combine them into a single price.

amazon %>% 
  html_nodes(".a-price-whole") %>% 
  html_text() -> price_whole
head(price_whole)
#> [1] "40." "24." "33." "29." "34." "61."

Scraping (the rest of) the book price

amazon %>% 
  html_nodes(".a-price-fraction") %>% 
  html_text() -> price_fraction
head(price_fraction)
#> [1] "10" "99" "04" "99" "37" "60"

Combine price portions

price <- paste(price_whole, price_fraction, sep = "")
head(price)
#> [1] "40.10" "24.99" "33.04" "29.99" "34.37" "61.60"

Make it numeric

price <- as.numeric(price)
head(price)
#> [1] 40.10 24.99 33.04 29.99 34.37 61.60

Rating

Scraping the book rating

amazon %>% 
  html_nodes("i.a-icon.a-icon-star-small.aok-align-bottom") %>% 
  html_text() -> rating
head(rating)
#> [1] "4.7 out of 5 stars" "4.3 out of 5 stars" "4.5 out of 5 stars"
#> [4] "4.7 out of 5 stars" "4.8 out of 5 stars" "4.4 out of 5 stars"

Let’s trim this into a usable metric

rating <- substr(rating, 1, 3)
head(rating)
#> [1] "4.7" "4.3" "4.5" "4.7" "4.8" "4.4"

Make it numeric

rating <- as.numeric(rating)
head(rating)
#> [1] 4.7 4.3 4.5 4.7 4.8 4.4

Rating Counts

Scraping the book rating count

This element is messier and we’ll need a number of cleaning steps.

amazon %>% 
  html_nodes("div.a-row.a-size-small") %>% 
  html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427\n            \n        \n        \n    \n\n\n\n"
#> [2] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76\n            \n        \n        \n    \n\n\n\n" 
#> [3] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255\n            \n        \n        \n    \n\n\n\n"
#> [4] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n" 
#> [5] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.8 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                31\n            \n        \n        \n    \n\n\n\n" 
#> [6] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.4 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n"

Trim the rating count

rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427"
#> [2] "4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76" 
#> [3] "4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255"
#> [4] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14" 
#> [5] "4.8 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                31" 
#> [6] "4.4 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14"

Rating count - substring

rate_n <- str_sub(rate_n, -5)
head(rate_n)
#> [1] "  427" "   76" "  255" "   14" "   31" "   14"

Trim the rating count (again)

rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "427" "76"  "255" "14"  "31"  "14"

Set as numeric

rate_n <- as.numeric(rate_n)
head(rate_n)
#> [1] 427  76 255  14  31  14

Publication Date

Scraping the book publication date

amazon %>% 
  html_nodes("span.a-size-base.a-color-secondary.a-text-normal") %>% 
  html_text() -> pub_dt
head(pub_dt)
#> [1] "Jan 10, 2017" "Jul 16, 2016" "Apr 5, 2012"  "Nov 30, 2018" "May 30, 2019"
#> [6] "Dec 5, 2018"

Convert to a date

pub_dt <- as.Date(pub_dt, "%b %d, %Y")
head(pub_dt)
#> [1] "2017-01-10" "2016-07-16" "2012-04-05" "2018-11-30" "2019-05-30"
#> [6] "2018-12-05"

We Have the Pieces

Let’s assemble the file!

  1. Titles
  2. Formats
  3. Prices
  4. Ratings
  5. Rating Counts
  6. Publication Date

Let’s check the scrapes

length(titles)
#> [1] 16
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 14
length(rate_n)
#> [1] 14
length(pub_dt)
#> [1] 16

Wait! What?!?

What Happened?

An issue with scraping is sometimes you get an uneven number of records due to missing data elements.

We can fix this!

…manually…

Fixing the Scrapes

Fixing Titles

All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.

titles %>% 
  rep(, each = 2) -> titles
length(titles)
#> [1] 32

Fixing Titles

Some titles have more than 2 formats.

titles %>% 
  append(values = titles[15], after = 15) %>% 
  append(values = titles[11], after = 11) %>% 
  append(values = titles[9], after = 9) %>% 
  append(values = titles[5], after = 5) -> titles
length(titles)
#> [1] 36

Fixing Formats

Nothing needed here!

length(format)
#> [1] 36

Fixing Prices

Or here!

length(price)
#> [1] 36

Fixing Ratings

Two books don’t have ratings.

rating %>% 
  append(values = NA, after = 7) %>% 
  append(values = NA, after = 11) -> rating
length(rating)
#> [1] 16

Fixing Ratings

Like titles, the ratings need to be repeated.

The same corrections are done here.

rating %>% 
  rep(, each = 2) -> rating
length(rating)
#> [1] 32

Fixing Ratings

Books with more than 2 formats.

rating %>% 
  append(values = rating[15], after = 15) %>% 
  append(values = rating[11], after = 11) %>% 
  append(values = rating[9], after = 9) %>% 
  append(values = rating[5], after = 5) -> rating
length(rating)
#> [1] 36

Fixing Rating Counts

Not all titles have a rating and won’t have a rating count.

rate_n %>% 
  append(values = NA, after = 7) %>% 
  append(values = NA, after = 11) -> rate_n
length(rate_n)
#> [1] 16

Fixing Rating Counts

Like titles, the rating counts need to be repeated.

The same corrections are done here.

rate_n %>% 
  rep(, each = 2) -> rate_n
length(rate_n)
#> [1] 32

Fixing Rating Counts

Books with more than 2 formats.

rate_n %>% 
  append(values = rate_n[15], after = 15) %>% 
  append(values = rate_n[11], after = 11) %>% 
  append(values = rate_n[9], after = 9) %>% 
  append(values = rate_n[5], after = 5) -> rate_n
length(rate_n)
#> [1] 36

Fixing Publication Date

Like titles, the publication dates need to be repeated.

pub_dt %>% 
  rep(, each = 2) -> pub_dt
length(pub_dt)
#> [1] 32

Fixing Publication Date

Books with more than 2 formats.

pub_dt %>% 
  append(values = pub_dt[15], after = 15) %>% 
  append(values = pub_dt[11], after = 11) %>% 
  append(values = pub_dt[9], after = 9) %>% 
  append(values = pub_dt[5], after = 5) -> pub_dt
length(pub_dt)
#> [1] 36

One Last Check!

length(titles)
#> [1] 36
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 36
length(rate_n)
#> [1] 36
length(pub_dt)
#> [1] 36

(Finally) Assemble the Data

r_books <- tibble(title            = titles,
                  text_format      = format,
                  price            = price,
                  rating           = rating,
                  num_ratings      = rate_n,
                  publication_date = pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#>   title                    text_format price rating num_ratings publication_date
#>   <chr>                    <chr>       <dbl>  <dbl>       <dbl> <date>          
#> 1 R for Data Science: Imp~ Paperback    40.1    4.7         427 2017-01-10      
#> 2 R for Data Science: Imp~ Kindle       25.0    4.7         427 2017-01-10      
#> 3 The Book of R: A First ~ Paperback    33.0    4.3          76 2016-07-16      
#> 4 The Book of R: A First ~ eTextbook    30.0    4.3          76 2016-07-16      
#> 5 Discovering Statistics ~ Paperback    34.4    4.5         255 2012-04-05      
#> 6 Discovering Statistics ~ Kindle       61.6    4.5         255 2012-04-05

References & Resources

References & Resources continued

Thank you


@mjhendrickson


matthewjhendrickson


mjhendrickson


Web Scraping in R & rvest on GitHub

This talk is freely distributed under the MIT License.